import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Read data¶
Start with the penguins dataset again.
penguins = sns.load_dataset("penguins")
penguins.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 344 entries, 0 to 343 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species 344 non-null object 1 island 344 non-null object 2 bill_length_mm 342 non-null float64 3 bill_depth_mm 342 non-null float64 4 flipper_length_mm 342 non-null float64 5 body_mass_g 342 non-null float64 6 sex 333 non-null object dtypes: float64(4), object(3) memory usage: 18.9+ KB
WHY PCA???¶
The penguins data has 4 num cols!!
sns.pairplot(data=penguins)
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
There is a clear relationship or correlation structure between several of the num cols!
fig, ax = plt.subplots()
sns.heatmap(data=penguins.corr(numeric_only=True),
vmin=-1,
vmax=1,
center=0,
annot=True,
annot_kws={"fontsize":15},
cmap="coolwarm")
plt.show()
BUT WHY PCA??
PCA tries to EXPLOIT correlation between variables. This is beneficial because maybe we do NOT actually need to look at ALL pairs of scatter plots!
If we can exploit the RELATIONSHIP between variables, maybe we can CREATE NEW variables that CAPTURE the impact or influence of ALL variables!!!
Then instead of having to explore a large number of figs, we can focus on the relationship between several NEWLY created variables!!!
PCA will be discussed in more detail in CMPINF 2120. We will also revisit PCA later in the semester in this course CMPINF 2100. But for now lets just see how to USE PCA to support visualization.
Executive PCA¶
Before executing PCA, we MUST deal with MISSINGS, such as DROPPING THEM!! Also, is it HIGHLY RECOMMENDED that you STANDARDIZE the variables BEFORE applying PCA!!!
pens_clean = penguins.dropna().copy()
pens_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 333 entries, 0 to 343 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species 333 non-null object 1 island 333 non-null object 2 bill_length_mm 333 non-null float64 3 bill_depth_mm 333 non-null float64 4 flipper_length_mm 333 non-null float64 5 body_mass_g 333 non-null float64 6 sex 333 non-null object dtypes: float64(4), object(3) memory usage: 20.8+ KB
STANDARDIZE using StandardScaler() method from scikit-learn.
from sklearn.preprocessing import StandardScaler
Standardize numeric columns
pens_clean_features = pens_clean.select_dtypes("number").copy()
Xpens = StandardScaler().fit_transform(pens_clean_features)
We can use the PCA method from scikit-learn to execute the TRANSFORMATION!!!
The transformation produces NEW vars that ACCOUNT for the relationship between ALL of the original numeric variables!!
from sklearn.decomposition import PCA
PCA follows the logic of StandardScaler. We must:
- INITIALIZE the object based on ASSUMPTIONS
- FIT the object
- TRANSFORM a data set using the FITTED object
The main assumption we need for PCA is the NUMBER OF COMPONENTS or the number of NEWLY CREATED VARIABLES to produce.
We will NOT discuss how to decide the BEST number of new vars today. Instead, we will just focus on 2 because we will VISUALIZE 2 numeric variables via scatter plots!!
Apply PCA in 1 line of code by INITIALIZING then FITTING and TRANSFORMING!!!
pca_pens = PCA(n_components=2).fit_transform(Xpens)
type(pca_pens)
numpy.ndarray
pca_pens
array([[-1.85359302e+00, 3.20693765e-02],
[-1.31625406e+00, -4.43526765e-01],
[-1.37660509e+00, -1.61230478e-01],
[-1.88528838e+00, -1.23512351e-02],
[-1.91998074e+00, 8.17598126e-01],
[-1.77302031e+00, -3.66222957e-01],
[-8.18496250e-01, 5.01243084e-01],
[-1.79895773e+00, -2.45393945e-01],
[-1.95614892e+00, 9.98282895e-01],
[-1.56952316e+00, 5.78081948e-01],
[-1.74800122e+00, -6.10244291e-01],
[-1.57577371e+00, 8.68357265e-02],
[-8.04720190e-01, 1.29355592e+00],
[-2.35017809e+00, -6.45191072e-01],
[-1.00498645e+00, 1.97242251e+00],
[-2.40824844e+00, -3.08968645e-01],
[-2.11369825e+00, -1.36493144e-01],
[-1.85705729e+00, -1.09144060e-01],
[-1.50501042e+00, -2.89127997e-01],
[-1.58113786e+00, -6.03932517e-01],
[-1.92846722e+00, -2.97394981e-01],
[-1.76295054e+00, 1.38259762e-01],
[-1.70361341e+00, -1.87802307e-01],
[-2.71417458e+00, -2.01106317e-01],
[-1.68232816e+00, 2.85542330e-01],
[-1.87994963e+00, -7.82580998e-01],
[-1.91081367e+00, -4.06695073e-01],
[-1.65683258e+00, -3.28286332e-01],
[-1.51840291e+00, 3.26408242e-01],
[-1.44646684e+00, -9.87685263e-01],
[-1.44062410e+00, 1.05909586e+00],
[-1.63466140e+00, 5.48223391e-01],
[-1.73335112e+00, 2.72394506e-01],
[-2.40765908e+00, 6.73451508e-02],
[-1.13764744e+00, 3.57809820e-01],
[-2.29657080e+00, -5.93801144e-01],
[-9.71848773e-01, 1.17509989e-01],
[-2.30890668e+00, -4.49404139e-01],
[-5.78401946e-01, 1.05458646e+00],
[-2.01067992e+00, -9.97271019e-01],
[-8.80262620e-01, 2.12079200e-01],
[-1.92925587e+00, 3.42881528e-01],
[-1.78298528e+00, -6.57410953e-01],
[-1.40940140e+00, 1.43826097e+00],
[-1.57392895e+00, -3.39592411e-01],
[-1.14654389e+00, 2.78170592e-01],
[-1.86608339e+00, -7.67327681e-01],
[-7.86733863e-01, 7.11147080e-01],
[-2.44789222e+00, -7.94851225e-01],
[-1.26418254e+00, 2.43767425e-01],
[-1.54901519e+00, -4.81769739e-01],
[-1.22044841e+00, 2.47154032e-01],
[-2.25876529e+00, -1.18962297e+00],
[-1.52359256e+00, 3.45359658e-02],
[-2.01615696e+00, -1.12589726e+00],
[-1.13641794e+00, 1.31328324e+00],
[-1.57091360e+00, -8.33767737e-01],
[-9.27431832e-01, 8.25272607e-02],
[-2.24489579e+00, -9.96917698e-01],
[-9.13660651e-01, 4.69928294e-02],
[-1.34180687e+00, -1.40816257e+00],
[-1.24076891e+00, 4.50049108e-01],
[-1.80093377e+00, -1.23282996e+00],
[-5.92025660e-01, 6.85886654e-01],
[-2.11142026e+00, -4.72533742e-01],
[-1.26934357e+00, -5.46639757e-03],
[-1.02609887e+00, -5.33157420e-01],
[-4.04478774e-01, 8.94152813e-01],
[-1.57243857e+00, -8.50558396e-01],
[-5.86662602e-01, 4.11120851e-01],
[-9.40429442e-01, -5.40033074e-01],
[-1.92733875e+00, 1.22172496e-01],
[-1.45634863e+00, -1.35600020e+00],
[-9.37516173e-01, 5.53350680e-01],
[-1.96939557e+00, -1.11892242e+00],
[-4.68328228e-02, 1.00901565e-01],
[-1.79183530e+00, -1.84002791e-01],
[-1.52578740e+00, -7.63992510e-02],
[-1.68181280e+00, -5.64107174e-01],
[-1.59639832e+00, 9.08101945e-01],
[-1.84348437e+00, 5.67099175e-02],
[-1.85729280e+00, -2.70705718e-01],
[-1.55507132e+00, 1.68921592e-01],
[-1.62210133e+00, 4.00341309e-02],
[-1.26523377e+00, -6.35421240e-01],
[-2.00393880e-01, 7.11886336e-02],
[-2.02709537e+00, -1.20799740e+00],
[-1.00562068e+00, -8.72793004e-02],
[-1.87080090e+00, -8.93881281e-01],
[-2.64027702e-01, 3.63384244e-01],
[-1.57962369e+00, -1.19371375e-01],
[-6.84823504e-01, 1.46252964e-01],
[-2.52929467e+00, -1.76228161e+00],
[-7.79625987e-01, 4.39581247e-01],
[-1.59563939e+00, -7.40347063e-01],
[-3.86175400e-01, 8.69122059e-01],
[-1.80101966e+00, -1.27844482e+00],
[-1.51265850e+00, 4.66837670e-01],
[-2.00243546e+00, -2.13819033e-01],
[-1.85740516e+00, 1.61221992e-01],
[-8.48810852e-01, -6.22812685e-01],
[-1.71870377e+00, 4.77518186e-01],
[-1.98479380e+00, -8.20882689e-01],
[-2.13534661e-01, 7.08300149e-01],
[-7.38240082e-01, -9.54490503e-01],
[-6.44875034e-01, 1.47936161e+00],
[-1.48219852e+00, -3.54236566e-01],
[-7.39940610e-01, 7.53287890e-01],
[-1.70321099e+00, 9.15253811e-01],
[-6.32808164e-01, 3.02917228e-01],
[-1.84273238e+00, -7.89182569e-01],
[-1.60946727e+00, 5.72883733e-01],
[-1.73484802e+00, -1.06473097e+00],
[-1.62792299e+00, 1.74301453e-01],
[-1.95305685e+00, -9.48638014e-01],
[-1.66339271e+00, 3.06844796e-01],
[-1.82836539e+00, -5.65972121e-01],
[-6.70854593e-01, 2.24468852e-01],
[-1.88191000e+00, -1.59486466e+00],
[-8.76999276e-01, 3.49638746e-01],
[-1.56785175e+00, -4.87347294e-01],
[-6.19917518e-01, 1.92001813e-01],
[-1.60358507e+00, -6.89218352e-01],
[ 7.01808888e-02, 3.33984567e-01],
[-1.66069875e+00, -3.94507052e-01],
[-1.13411289e+00, 6.57034153e-01],
[-1.68043855e+00, -3.20534232e-01],
[-7.08387339e-01, -1.48385164e-01],
[-1.68833943e+00, -5.51677888e-01],
[-9.70355138e-01, -2.16004046e-01],
[-1.88183815e+00, -8.89082390e-01],
[-1.10935310e+00, 7.49111595e-01],
[-1.65603365e+00, -1.12119459e+00],
[-8.04934115e-01, -1.73395579e-01],
[-1.18214807e+00, -5.23204908e-01],
[-1.36523239e+00, -4.34095818e-01],
[-1.97590114e+00, -2.09674424e+00],
[-1.02176382e+00, -4.79069974e-01],
[-1.67693429e+00, -1.00189205e+00],
[-1.76540033e+00, 1.32217563e-02],
[-1.11219725e+00, 5.38438736e-02],
[-2.06481174e+00, -3.89108766e-01],
[-1.55660384e+00, -6.95834197e-01],
[-1.34524467e+00, -3.48806583e-01],
[-1.57336339e+00, -9.58805742e-01],
[-6.18303403e-01, 2.46934847e-01],
[-7.93836850e-01, 5.02297056e-01],
[-3.89368970e-01, 1.57456101e+00],
[-5.15027365e-01, 1.57096244e+00],
[-1.19537904e+00, 7.06041689e-01],
[-3.04312640e-01, 1.97658038e+00],
[-3.26614065e-01, 3.64192175e-01],
[-1.63592055e+00, 5.50237855e-01],
[-7.88452183e-02, 1.17721487e+00],
[-4.70293902e-01, 9.15308965e-01],
[-4.16818937e-01, 1.86122420e+00],
[-5.18914090e-01, 5.01742105e-01],
[-5.78352198e-01, 2.07263418e+00],
[-7.82308030e-01, 3.30433534e-01],
[ 3.69588535e-01, 1.24385074e+00],
[-7.12498702e-01, 1.18722739e-01],
[-5.94770968e-02, 1.68634410e+00],
[-8.34897001e-01, 1.75334376e+00],
[-1.34571116e-01, 1.74031931e+00],
[-1.06082686e+00, 7.69161630e-01],
[ 1.08599515e-01, 1.00737973e+00],
[-1.39779585e+00, -1.86348140e-01],
[-6.56046746e-01, 5.50241661e-01],
[-1.42052018e+00, -4.45944135e-01],
[-5.11234668e-01, 1.58926869e+00],
[-7.90299107e-01, 5.06500522e-01],
[ 9.04349588e-02, 1.61612776e+00],
[-3.01545221e-01, 1.13821856e+00],
[-2.32942707e-01, 1.30929055e+00],
[-6.86335466e-01, 4.69421229e-01],
[ 5.57174822e-01, 2.15032356e+00],
[-1.40654483e+00, -6.70221602e-01],
[ 1.75368656e-01, 2.60270659e+00],
[-1.19133191e+00, -4.39598103e-01],
[ 2.61046726e-01, 1.42295499e+00],
[-4.77965711e-01, 1.14822032e+00],
[ 7.44911095e-02, 2.07746781e-01],
[-4.20670546e-01, 8.19697344e-01],
[ 7.25638767e-01, 2.37167261e+00],
[-1.04370429e+00, -5.62049265e-02],
[ 6.01454563e-01, 2.18201887e+00],
[ 1.38759674e-01, 1.47518981e+00],
[-8.41124399e-01, 3.19554637e-01],
[-4.72686923e-01, 1.47823497e+00],
[-5.29414382e-01, 2.96136491e-02],
[-1.43693398e-01, 1.00422814e+00],
[ 4.62160556e-01, 1.31195693e+00],
[-6.45485399e-01, 8.87659748e-01],
[ 4.40184372e-01, 1.54979441e+00],
[-9.17707186e-01, 1.34996671e+00],
[-3.08991857e-02, 6.41199552e-01],
[-1.87582034e-01, 5.70474676e-02],
[ 6.87115867e-02, 1.53281144e+00],
[-6.28963576e-01, 1.81340228e-01],
[ 1.92827122e-02, 1.74964587e+00],
[-1.31309930e+00, -1.96650725e-01],
[-3.30925312e-01, 1.49055629e+00],
[-8.50169944e-01, -1.91170114e-01],
[-1.37643774e-01, 1.67674491e+00],
[-5.17501489e-02, 1.30607700e+00],
[-1.07351711e+00, 1.01394523e+00],
[ 2.14874699e-01, 1.79229394e+00],
[-5.05885138e-01, -1.85804289e-02],
[-4.51461631e-01, 6.54489015e-02],
[ 5.53474521e-01, 2.34761163e+00],
[-7.39913565e-01, 2.48154967e-01],
[-3.67889760e-01, 9.91079624e-01],
[ 4.92359602e-01, 1.48484928e+00],
[-2.13416837e-01, 1.26155380e+00],
[ 1.59356859e+00, -1.34179573e+00],
[ 2.89205390e+00, 4.64090012e-01],
[ 1.55157173e+00, -6.96759932e-01],
[ 2.62068561e+00, 1.37233188e-02],
[ 2.23455895e+00, -5.63287236e-01],
[ 1.55889027e+00, -1.17201378e+00],
[ 1.45637702e+00, -8.23329213e-01],
[ 2.02554963e+00, -3.55648737e-01],
[ 1.16950299e+00, -1.57891764e+00],
[ 1.81451188e+00, -3.10575391e-01],
[ 1.28618821e+00, -1.69540027e+00],
[ 2.16995116e+00, 2.53134960e-01],
[ 1.66843953e+00, -1.18978332e+00],
[ 2.50595964e+00, -3.92893307e-01],
[ 1.03819687e+00, -8.36838135e-01],
[ 2.52237724e+00, 1.53089665e-01],
[ 9.11480748e-01, -1.70468040e+00],
[ 3.08806126e+00, -1.59072567e-02],
[ 1.46071532e+00, -7.76714255e-01],
[ 2.45853762e+00, -2.01291445e-01],
[ 2.81995631e+00, -3.28714403e-01],
[ 1.75334566e+00, -8.76120401e-01],
[ 1.37704625e+00, -7.80126191e-01],
[ 1.62341756e+00, -2.13079172e-01],
[ 1.85465371e+00, -1.68481442e+00],
[ 1.78304339e+00, -5.13745958e-01],
[ 2.32062326e+00, -3.15071902e-01],
[ 1.57198405e+00, -6.56470332e-01],
[ 2.58027529e+00, 4.07762389e-02],
[ 2.23324413e+00, -2.83702741e-01],
[ 1.17069843e+00, -1.28141516e+00],
[ 1.45779016e+00, -8.74674010e-01],
[ 3.78701834e+00, 1.83601539e+00],
[ 2.33349180e+00, -2.98646309e-01],
[ 2.14182216e+00, 2.55556267e-01],
[ 1.59133864e+00, -1.48042442e+00],
[ 1.46271619e+00, 2.06122550e-01],
[ 1.11168166e+00, -1.42616223e+00],
[ 1.75972697e+00, 3.58655736e-02],
[ 7.09891542e-01, -1.56660409e+00],
[ 2.71361148e+00, 2.96581645e-01],
[ 1.24766590e+00, -1.24670723e+00],
[ 1.89611421e+00, -2.02401218e-01],
[ 2.58249171e+00, 3.39509176e-01],
[ 1.76453362e+00, -1.29262601e+00],
[ 1.15532939e+00, -1.15325176e+00],
[ 2.60359333e+00, 3.26484463e-01],
[ 1.96619306e+00, -1.37531536e+00],
[ 1.70292714e+00, -3.10211734e-01],
[ 1.63023914e+00, -8.49052487e-01],
[ 2.52824538e+00, -6.33769450e-01],
[ 1.15735133e+00, -9.75741632e-01],
[ 2.47953715e+00, -1.19944640e-01],
[ 1.90404533e+00, -7.71411353e-01],
[ 1.80265515e+00, -5.15867852e-01],
[ 9.99994848e-01, -1.33142706e+00],
[ 1.89119896e+00, -6.27629575e-01],
[ 9.30919099e-01, -1.14016421e+00],
[ 2.77838403e+00, 8.63973188e-02],
[ 1.07656958e+00, -1.21655353e+00],
[ 2.21598059e+00, -5.62234490e-01],
[ 1.47355252e+00, -1.11059336e+00],
[ 3.37817705e+00, 6.89442994e-01],
[ 1.83216652e+00, -9.47529000e-01],
[ 2.77396146e+00, 6.44562815e-01],
[ 2.89794904e+00, 3.77737157e-01],
[ 1.68225824e+00, -1.19992388e+00],
[ 2.82297979e+00, -2.51494909e-03],
[ 1.73822780e+00, -4.11243001e-01],
[ 1.88543725e+00, -2.85343544e-01],
[ 2.10338085e+00, -7.79830974e-02],
[ 2.02796808e+00, -5.80915426e-01],
[ 1.59601675e+00, -5.58889916e-01],
[ 2.90496724e+00, 1.98243239e-01],
[ 1.49289256e+00, -7.74316869e-01],
[ 2.77638908e+00, 6.09393449e-01],
[ 1.73279991e+00, -1.17234318e+00],
[ 2.35528428e+00, -2.13839444e-03],
[ 1.70570972e+00, -4.73358039e-01],
[ 2.69998725e+00, 4.27945010e-01],
[ 1.61251537e+00, -6.10214911e-01],
[ 2.48664340e+00, 2.66357334e-01],
[ 1.58421835e+00, -1.20655899e+00],
[ 2.60478500e+00, 9.46598161e-01],
[ 1.48255754e+00, -1.14027062e+00],
[ 2.65819078e+00, -2.86338576e-01],
[ 1.84514307e+00, -8.27905113e-01],
[ 2.82194749e+00, 9.64088245e-01],
[ 1.94077693e+00, -4.13378481e-01],
[ 2.62497748e+00, 1.00047845e+00],
[ 1.49201527e+00, -8.57170341e-01],
[ 2.60960622e+00, 3.20912437e-01],
[ 1.51912978e+00, -8.75767080e-01],
[ 2.57359526e+00, 2.59869947e-01],
[ 1.83678033e+00, 1.16188362e-01],
[ 2.08569057e+00, -6.46771800e-01],
[ 1.29687923e+00, -5.94513353e-01],
[ 2.42913431e+00, 6.21116375e-01],
[ 1.99672542e+00, -3.12558491e-01],
[ 3.08946906e+00, 1.38569979e+00],
[ 1.70781430e+00, -2.42760558e-01],
[ 2.86192618e+00, -1.81068897e-01],
[ 1.91173444e+00, 6.14939272e-03],
[ 1.01903506e+00, -1.19945381e+00],
[ 2.68593517e+00, 6.11780499e-01],
[ 1.12616049e+00, -1.31974077e+00],
[ 1.97540337e+00, -2.58352741e-01],
[ 2.10123089e+00, 1.28213706e-03],
[ 3.08631267e+00, 3.03503990e-01],
[ 1.15660747e+00, -8.02661928e-01],
[ 2.87996707e+00, 6.09944432e-01],
[ 1.58107282e+00, -9.75789324e-01],
[ 3.47928847e+00, 9.17457141e-01],
[ 2.68799274e+00, 3.16920939e-01],
[ 1.99771558e+00, -9.76771459e-01],
[ 1.83265107e+00, -7.84509926e-01],
[ 2.75150503e+00, 2.66555715e-01],
[ 1.71385366e+00, -7.25875158e-01],
[ 2.01853683e+00, 3.36553720e-01]])
pca_pens.shape
(333, 2)
Xpens.shape
(333, 4)
pens_clean_features.shape
(333, 4)
Convert the NumPy array pca_pens into a DataFrame to support visualization.
Name the cols, pc01 and pc02.
pca_pens_df = pd.DataFrame(pca_pens, columns=["pc01", "pc02"])
pca_pens_df
| pc01 | pc02 | |
|---|---|---|
| 0 | -1.853593 | 0.032069 |
| 1 | -1.316254 | -0.443527 |
| 2 | -1.376605 | -0.161230 |
| 3 | -1.885288 | -0.012351 |
| 4 | -1.919981 | 0.817598 |
| ... | ... | ... |
| 328 | 1.997716 | -0.976771 |
| 329 | 1.832651 | -0.784510 |
| 330 | 2.751505 | 0.266556 |
| 331 | 1.713854 | -0.725875 |
| 332 | 2.018537 | 0.336554 |
333 rows × 2 columns
Visualize the relationsip beween these two NEWLY ceated vars as a scatter plot.
sns.relplot(data=pca_pens_df, x="pc01", y="pc02")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Lets calculate the CORRELATION MATRIX between these two new vars!
fig, ax = plt.subplots()
sns.heatmap(pca_pens_df.corr(),
vmin=-1,
vmax=1,
center=0,
fmt=".3f",
cmap="coolwarm",
cbar=False,
annot=True,
annot_kws={"fontsize": 15},
ax=ax)
plt.show()
sns.lmplot(data=pca_pens_df, x="pc01", y="pc02")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
But we can include GROUPING variables with our PCA!!!
pca_pens_df["species"] = pens_clean.species
sns.lmplot(data=pca_pens_df, x="pc01", y="pc02", hue="species")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
We saw it was easy to SEPARATE the penguins data into 2 clusters!!
The PCA or the NEWLY CREATED VARS are EASILY identifying the 2 PRIMARY GROUPS in the data!!!!
Clustering and PCA¶
Instead of visualizing the Clustering results on the original variables, lets visualize the CLUSTERING results with the NEWLY created PCA!!!
from sklearn.cluster import KMeans
clusters_2 = KMeans(n_clusters=2, random_state=121, n_init=25, max_iter=500).fit_predict(Xpens)
pca_pens_df['k2'] = pd.Series(clusters_2, index=pca_pens_df.index).astype("category")
pca_pens_df
| pc01 | pc02 | species | k2 | |
|---|---|---|---|---|
| 0 | -1.853593 | 0.032069 | Adelie | 0 |
| 1 | -1.316254 | -0.443527 | Adelie | 0 |
| 2 | -1.376605 | -0.161230 | Adelie | 0 |
| 3 | -1.885288 | -0.012351 | NaN | 0 |
| 4 | -1.919981 | 0.817598 | Adelie | 0 |
| ... | ... | ... | ... | ... |
| 328 | 1.997716 | -0.976771 | Gentoo | 1 |
| 329 | 1.832651 | -0.784510 | Gentoo | 1 |
| 330 | 2.751505 | 0.266556 | Gentoo | 1 |
| 331 | 1.713854 | -0.725875 | Gentoo | 1 |
| 332 | 2.018537 | 0.336554 | Gentoo | 1 |
333 rows × 4 columns
pca_pens_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 333 entries, 0 to 332 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 pc01 333 non-null float64 1 pc02 333 non-null float64 2 species 324 non-null object 3 k2 333 non-null category dtypes: category(1), float64(2), object(1) memory usage: 8.4+ KB
sns.relplot(data=pca_pens_df, x="pc01", y="pc02", hue="k2")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Larger example¶
On canvas, there is a WINE DATA SET.
wine_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
wine_names = ['Cultivar', 'Alcohol', 'Malic_acid', 'Ash', 'Alcalinity_of_ash', 'Magnesium', 'Total_phenols',
'Flavanoids', 'Nonflavanoid_phenols', 'Proanthocyanin', 'Color_intensity', 'Hue', 'OD280_OD315', 'Proline']
wine_data = pd.read_csv(wine_url, names=wine_names)
wine_data
| Cultivar | Alcohol | Malic_acid | Ash | Alcalinity_of_ash | Magnesium | Total_phenols | Flavanoids | Nonflavanoid_phenols | Proanthocyanin | Color_intensity | Hue | OD280_OD315 | Proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
| 1 | 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
| 2 | 1 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
| 3 | 1 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
| 4 | 1 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 173 | 3 | 13.71 | 5.65 | 2.45 | 20.5 | 95 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740 |
| 174 | 3 | 13.40 | 3.91 | 2.48 | 23.0 | 102 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750 |
| 175 | 3 | 13.27 | 4.28 | 2.26 | 20.0 | 120 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835 |
| 176 | 3 | 13.17 | 2.59 | 2.37 | 20.0 | 120 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840 |
| 177 | 3 | 14.13 | 4.10 | 2.74 | 24.5 | 96 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560 |
178 rows × 14 columns
wine_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 178 entries, 0 to 177 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Cultivar 178 non-null int64 1 Alcohol 178 non-null float64 2 Malic_acid 178 non-null float64 3 Ash 178 non-null float64 4 Alcalinity_of_ash 178 non-null float64 5 Magnesium 178 non-null int64 6 Total_phenols 178 non-null float64 7 Flavanoids 178 non-null float64 8 Nonflavanoid_phenols 178 non-null float64 9 Proanthocyanin 178 non-null float64 10 Color_intensity 178 non-null float64 11 Hue 178 non-null float64 12 OD280_OD315 178 non-null float64 13 Proline 178 non-null int64 dtypes: float64(11), int64(3) memory usage: 19.6 KB
wine_data.isna().sum()
Cultivar 0 Alcohol 0 Malic_acid 0 Ash 0 Alcalinity_of_ash 0 Magnesium 0 Total_phenols 0 Flavanoids 0 Nonflavanoid_phenols 0 Proanthocyanin 0 Color_intensity 0 Hue 0 OD280_OD315 0 Proline 0 dtype: int64
wine_data.Cultivar.value_counts()
Cultivar 2 71 1 59 3 48 Name: count, dtype: int64
Convert Cultivar to a categorical variable.
wine_data["Cultivar"] = wine_data.Cultivar.astype("category")
wine_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 178 entries, 0 to 177 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Cultivar 178 non-null category 1 Alcohol 178 non-null float64 2 Malic_acid 178 non-null float64 3 Ash 178 non-null float64 4 Alcalinity_of_ash 178 non-null float64 5 Magnesium 178 non-null int64 6 Total_phenols 178 non-null float64 7 Flavanoids 178 non-null float64 8 Nonflavanoid_phenols 178 non-null float64 9 Proanthocyanin 178 non-null float64 10 Color_intensity 178 non-null float64 11 Hue 178 non-null float64 12 OD280_OD315 178 non-null float64 13 Proline 178 non-null int64 dtypes: category(1), float64(11), int64(2) memory usage: 18.5 KB
Why will PCA help here?
We could make a PAIRS PLOT between all 13 numeric columns...
sns.pairplot(data=wine_data,
hue="Cultivar",
diag_kws={"common_norm":False})
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
PCA allows us to EXPLOIT relationships between columns!!
We know if there are relationships by creating CORRELATION PLOTS!!!
fig, ax = plt.subplots()
sns.heatmap(data=wine_data.corr(numeric_only=True),
vmin=-1,
vmax=1,
center=0,
cbar=False,
cmap="coolwarm",
annot=True,
annot_kws={"fontsize":7},
ax=ax)
plt.show()
BEFORE we execute PCA, we need to check the MAGNITUDE and SCALES!!!
sns.catplot(data=wine_data, kind="box", aspect=2)
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Preprocess or STANDARDIZE the data¶
wine_data_features = wine_data.select_dtypes("number").copy()
wine_data_features
| Alcohol | Malic_acid | Ash | Alcalinity_of_ash | Magnesium | Total_phenols | Flavanoids | Nonflavanoid_phenols | Proanthocyanin | Color_intensity | Hue | OD280_OD315 | Proline | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14.23 | 1.71 | 2.43 | 15.6 | 127 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065 |
| 1 | 13.20 | 1.78 | 2.14 | 11.2 | 100 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050 |
| 2 | 13.16 | 2.36 | 2.67 | 18.6 | 101 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185 |
| 3 | 14.37 | 1.95 | 2.50 | 16.8 | 113 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480 |
| 4 | 13.24 | 2.59 | 2.87 | 21.0 | 118 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 173 | 13.71 | 5.65 | 2.45 | 20.5 | 95 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740 |
| 174 | 13.40 | 3.91 | 2.48 | 23.0 | 102 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750 |
| 175 | 13.27 | 4.28 | 2.26 | 20.0 | 120 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835 |
| 176 | 13.17 | 2.59 | 2.37 | 20.0 | 120 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840 |
| 177 | 14.13 | 4.10 | 2.74 | 24.5 | 96 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560 |
178 rows × 13 columns
Xwine = StandardScaler().fit_transform(wine_data_features)
Xwine
array([[ 1.51861254, -0.5622498 , 0.23205254, ..., 0.36217728,
1.84791957, 1.01300893],
[ 0.24628963, -0.49941338, -0.82799632, ..., 0.40605066,
1.1134493 , 0.96524152],
[ 0.19687903, 0.02123125, 1.10933436, ..., 0.31830389,
0.78858745, 1.39514818],
...,
[ 0.33275817, 1.74474449, -0.38935541, ..., -1.61212515,
-1.48544548, 0.28057537],
[ 0.20923168, 0.22769377, 0.01273209, ..., -1.56825176,
-1.40069891, 0.29649784],
[ 1.39508604, 1.58316512, 1.36520822, ..., -1.52437837,
-1.42894777, -0.59516041]])
sns.catplot(data=pd.DataFrame(Xwine, columns=wine_data_features.columns), kind="box", aspect=3)
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Execute PCA and return 2 newly created variables!!
pca_wine = PCA(n_components=2).fit_transform(Xwine)
pca_wine
array([[ 3.31675081, -1.44346263],
[ 2.20946492, 0.33339289],
[ 2.51674015, -1.0311513 ],
[ 3.75706561, -2.75637191],
[ 1.00890849, -0.86983082],
[ 3.05025392, -2.12240111],
[ 2.44908967, -1.17485013],
[ 2.05943687, -1.60896307],
[ 2.5108743 , -0.91807096],
[ 2.75362819, -0.78943767],
[ 3.47973668, -1.30233324],
[ 1.7547529 , -0.61197723],
[ 2.11346234, -0.67570634],
[ 3.45815682, -1.13062988],
[ 4.31278391, -2.09597558],
[ 2.3051882 , -1.66255173],
[ 2.17195527, -2.32730534],
[ 1.89897118, -1.63136888],
[ 3.54198508, -2.51834367],
[ 2.0845222 , -1.06113799],
[ 3.12440254, -0.78689711],
[ 1.08657007, -0.24174355],
[ 2.53522408, 0.09184062],
[ 1.64498834, 0.51627893],
[ 1.76157587, 0.31714893],
[ 0.9900791 , -0.94066734],
[ 1.77527763, -0.68617513],
[ 1.23542396, 0.08980704],
[ 2.18840633, -0.68956962],
[ 2.25610898, -0.19146194],
[ 2.50022003, -1.24083383],
[ 2.67741105, -1.47187365],
[ 1.62857912, -0.05270445],
[ 1.90269086, -1.63306043],
[ 1.41038853, -0.69793432],
[ 1.90382623, -0.17671095],
[ 1.38486223, -0.65863985],
[ 1.12220741, -0.11410976],
[ 1.5021945 , 0.76943201],
[ 2.52980109, -1.80300198],
[ 2.58809543, -0.7796163 ],
[ 0.66848199, -0.16996094],
[ 3.07080699, -1.15591896],
[ 0.46220914, -0.33074213],
[ 2.10135193, 0.07100892],
[ 1.13616618, -1.77710739],
[ 2.72660096, -1.19133469],
[ 2.82133927, -0.6462586 ],
[ 2.00985085, -1.24702946],
[ 2.7074913 , -1.75196741],
[ 3.21491747, -0.16699199],
[ 2.85895983, -0.7452788 ],
[ 3.50560436, -1.61273386],
[ 2.22479138, -1.875168 ],
[ 2.14698782, -1.01675154],
[ 2.46932948, -1.32900831],
[ 2.74151791, -1.43654878],
[ 2.17374092, -1.21219984],
[ 3.13938015, -1.73157912],
[-0.92858197, 3.07348616],
[-1.54248014, 1.38144351],
[-1.83624976, 0.82998412],
[ 0.03060683, 1.26278614],
[ 2.05026161, 1.9250326 ],
[-0.60968083, 1.90805881],
[ 0.90022784, 0.76391147],
[ 2.24850719, 1.88459248],
[ 0.18338403, 2.42714611],
[-0.81280503, 0.22051399],
[ 1.9756205 , 1.40328323],
[-1.57221622, 0.88498314],
[ 1.65768181, 0.9567122 ],
[-0.72537239, 1.0636454 ],
[ 2.56222717, -0.26019855],
[ 1.83256757, 1.2878782 ],
[-0.8679929 , 2.44410119],
[ 0.3700144 , 2.15390698],
[-1.45737704, 1.38335177],
[ 1.26293085, 0.77084953],
[ 0.37615037, 1.0270434 ],
[ 0.7620639 , 3.37505381],
[ 1.03457797, 1.45070974],
[-0.49487676, 2.38124353],
[-2.53897708, 0.08744336],
[ 0.83532015, 1.47367055],
[ 0.78790461, 2.02662652],
[-0.80683216, 2.23383039],
[-0.55804262, 2.37298543],
[-1.11511104, 1.80224719],
[-0.55572283, 2.65754004],
[-1.34928528, 2.11800147],
[-1.56448261, 1.85221452],
[-1.93255561, 1.55949546],
[ 0.74666594, 2.31293171],
[ 0.95745536, 2.22352843],
[ 2.54386518, -0.16927402],
[-0.54395259, 0.36892655],
[ 1.03104975, 2.56556935],
[ 2.25190942, 1.43274138],
[ 1.41021602, 2.16619177],
[ 0.79771979, 2.3769488 ],
[-0.54953173, 2.29312864],
[-0.16117374, 1.16448332],
[-0.65979494, 2.67996119],
[ 0.39235441, 2.09873171],
[-1.77249908, 1.71728847],
[-0.36626736, 2.1693533 ],
[-1.62067257, 1.35558339],
[ 0.08253578, 2.30623459],
[ 1.57827507, 1.46203429],
[ 1.42056925, 1.41820664],
[-0.27870275, 1.93056809],
[-1.30314497, 0.76317231],
[-0.45707187, 2.26941561],
[-0.49418585, 1.93904505],
[ 0.48207441, 3.87178385],
[-0.25288888, 2.82149237],
[-0.10722764, 1.92892204],
[-2.4330126 , 1.25714104],
[-0.55108954, 2.22216155],
[ 0.73962193, 1.40895667],
[ 1.33632173, -0.25333693],
[-1.177087 , 0.66396684],
[-0.46233501, 0.61828818],
[ 0.97847408, 1.4455705 ],
[-0.09680973, 2.10999799],
[ 0.03848715, 1.26676211],
[-1.5971585 , 1.20814357],
[-0.47956492, 1.93884066],
[-1.79283347, 1.1502881 ],
[-1.32710166, -0.17038923],
[-2.38450083, -0.37458261],
[-2.9369401 , -0.26386183],
[-2.14681113, -0.36825495],
[-2.36986949, 0.45963481],
[-3.06384157, -0.35341284],
[-3.91575378, -0.15458252],
[-3.93646339, -0.65968723],
[-3.09427612, -0.34884276],
[-2.37447163, -0.29198035],
[-2.77881295, -0.28680487],
[-2.28656128, -0.37250784],
[-2.98563349, -0.48921791],
[-2.3751947 , -0.48233372],
[-2.20986553, -1.1600525 ],
[-2.625621 , -0.56316076],
[-4.28063878, -0.64967096],
[-3.58264137, -1.27270275],
[-2.80706372, -1.57053379],
[-2.89965933, -2.04105701],
[-2.32073698, -2.35636608],
[-2.54983095, -2.04528309],
[-1.81254128, -1.52764595],
[-2.76014464, -2.13893235],
[-2.7371505 , -0.40988627],
[-3.60486887, -1.80238422],
[-2.889826 , -1.92521861],
[-3.39215608, -1.31187639],
[-1.0481819 , -3.51508969],
[-1.60991228, -2.40663816],
[-3.14313097, -0.73816104],
[-2.2401569 , -1.17546529],
[-2.84767378, -0.55604397],
[-2.59749706, -0.69796554],
[-2.94929937, -1.55530896],
[-3.53003227, -0.8825268 ],
[-2.40611054, -2.59235618],
[-2.92908473, -1.27444695],
[-2.18141278, -2.07753731],
[-2.38092779, -2.58866743],
[-3.21161722, 0.2512491 ],
[-3.67791872, -0.84774784],
[-2.4655558 , -2.1937983 ],
[-3.37052415, -2.21628914],
[-2.60195585, -1.75722935],
[-2.67783946, -2.76089913],
[-2.38701709, -2.29734668],
[-3.20875816, -2.76891957]])
pca_wine_df = pd.DataFrame(pca_wine, columns=["pc01", "pc02"])
pca_wine_df
| pc01 | pc02 | |
|---|---|---|
| 0 | 3.316751 | -1.443463 |
| 1 | 2.209465 | 0.333393 |
| 2 | 2.516740 | -1.031151 |
| 3 | 3.757066 | -2.756372 |
| 4 | 1.008908 | -0.869831 |
| ... | ... | ... |
| 173 | -3.370524 | -2.216289 |
| 174 | -2.601956 | -1.757229 |
| 175 | -2.677839 | -2.760899 |
| 176 | -2.387017 | -2.297347 |
| 177 | -3.208758 | -2.768920 |
178 rows × 2 columns
sns.relplot(data=pca_wine_df, x="pc01", y="pc02")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Lets add the KNOWN Groupings
pca_wine_df["Cultivar"] = wine_data.Cultivar
pca_wine_df
| pc01 | pc02 | Cultivar | |
|---|---|---|---|
| 0 | 3.316751 | -1.443463 | 1 |
| 1 | 2.209465 | 0.333393 | 1 |
| 2 | 2.516740 | -1.031151 | 1 |
| 3 | 3.757066 | -2.756372 | 1 |
| 4 | 1.008908 | -0.869831 | 1 |
| ... | ... | ... | ... |
| 173 | -3.370524 | -2.216289 | 3 |
| 174 | -2.601956 | -1.757229 | 3 |
| 175 | -2.677839 | -2.760899 | 3 |
| 176 | -2.387017 | -2.297347 | 3 |
| 177 | -3.208758 | -2.768920 | 3 |
178 rows × 3 columns
sns.relplot(data=pca_wine_df, x="pc01", y="pc02", hue="Cultivar")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Run KMeans with 3 Clusters and visualize the 3 cluster labels with the NEWLY CREATED PCA!!!
clusters_3 = KMeans(n_clusters=3, random_state=121, n_init=25, max_iter=500).fit_predict(Xwine)
clusters_3
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 0, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0], dtype=int32)
pca_wine_df["k3"] = pd.Series(clusters_3, index=pca_wine_df.index).astype("category")
pca_wine_df
| pc01 | pc02 | Cultivar | k3 | |
|---|---|---|---|---|
| 0 | 3.316751 | -1.443463 | 1 | 1 |
| 1 | 2.209465 | 0.333393 | 1 | 1 |
| 2 | 2.516740 | -1.031151 | 1 | 1 |
| 3 | 3.757066 | -2.756372 | 1 | 1 |
| 4 | 1.008908 | -0.869831 | 1 | 1 |
| ... | ... | ... | ... | ... |
| 173 | -3.370524 | -2.216289 | 3 | 0 |
| 174 | -2.601956 | -1.757229 | 3 | 0 |
| 175 | -2.677839 | -2.760899 | 3 | 0 |
| 176 | -2.387017 | -2.297347 | 3 | 0 |
| 177 | -3.208758 | -2.768920 | 3 | 0 |
178 rows × 4 columns
sns.relplot(data=pca_wine_df, x="pc01", y="pc02", hue="k3", style="Cultivar")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
A really big example¶
Use the Sonar data!
sonar_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data'
sonar_df = pd.read_csv( sonar_url, header=None )
sonar_df.shape
(208, 61)
Convert the col names to strings.
sonar_df.columns = ["X%02d" % d for d in sonar_df.columns]
sonar_df.columns
Index(['X00', 'X01', 'X02', 'X03', 'X04', 'X05', 'X06', 'X07', 'X08', 'X09',
'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19',
'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29',
'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39',
'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49',
'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59',
'X60'],
dtype='object')
sonar_df.nunique()
X00 177
X01 182
X02 190
X03 181
X04 193
...
X56 121
X57 124
X58 119
X59 109
X60 2
Length: 61, dtype: int64
sonar_df.X60.value_counts()
X60 M 111 R 97 Name: count, dtype: int64
sonar_df.isna().sum().max()
0
Lets look at the correlation structure between ALL numeric cols!
fig, ax = plt.subplots()
sns.heatmap(sonar_df.corr(numeric_only=True),
vmin=-1,
vmax=1,
center=0,
cmap="coolwarm",
ax=ax)
plt.show()
Even tho there are 60 numeric columns many of the vars are HIGHLY CORRELATED!!
Lets exploit the correlation thru PCA!!
But first, we must check the scales!!
sns.catplot(data=sonar_df, kind="box", aspect=3)
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
The SCALES are NOT the same across cols so we need to standardize.
sonar_features = sonar_df.select_dtypes("number").copy()
Xsonar = StandardScaler().fit_transform(sonar_features)
Xsonar
array([[-0.39955135, -0.04064823, -0.02692565, ..., 0.06987027,
0.17167808, -0.65894689],
[ 0.70353822, 0.42163039, 1.05561832, ..., -0.47240644,
-0.44455424, -0.41985233],
[-0.12922901, 0.60106749, 1.72340448, ..., 1.30935987,
0.25276128, 0.25758223],
...,
[ 1.00438083, 0.16007801, -0.67384349, ..., 0.90652575,
-0.03913824, -0.67887143],
[ 0.04953255, -0.09539176, 0.13480381, ..., -0.00759783,
-0.70402047, -0.34015415],
[-0.13794908, -0.06497869, -0.78861924, ..., -0.6738235 ,
-0.29860448, 0.99479044]])
sns.catplot(data=pd.DataFrame(Xsonar, columns=sonar_features.columns), kind="box", aspect=3)
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
APPLY PCA and return 2 NEWLY Created Variables to support visualization!!
sonar_pca = PCA(n_components=2).fit_transform(Xsonar)
sonar_pca
array([[ 1.92116817, -1.37089312],
[-0.48012458, 7.58638801],
[ 3.8592282 , 6.43986016],
[ 4.59741943, -3.10408888],
[-0.53386761, 1.84984701],
[-1.24701593, 3.78548414],
[ 1.87007312, 2.49551038],
[-2.05769816, 2.3147504 ],
[-1.64556277, 0.25372155],
[-4.28065736, -2.42781795],
[-1.46164351, -6.32305562],
[-2.46394888, -1.2537634 ],
[-3.99546982, 1.64506244],
[ 0.6370814 , -0.63741683],
[-0.10539302, -0.25210417],
[ 2.11242307, 0.59393523],
[ 4.39574903, -2.25749069],
[ 1.43859617, 1.90219042],
[-1.03943408, -3.29436397],
[-1.16485881, 8.59655069],
[ 2.64812566, 1.66803742],
[ 6.23535677, -1.47389049],
[11.23389579, -2.75609298],
[-0.24732176, -4.86351661],
[ 2.65154822, -4.39934635],
[-0.42203896, -7.16826626],
[-3.69919995, 2.49392786],
[-2.90589296, 0.16356259],
[-1.8957691 , 1.49786172],
[-2.38880313, 1.37815246],
[-2.32050849, -1.198227 ],
[-3.50572573, -0.58086138],
[ 0.04322219, 0.36634604],
[ 1.0292047 , 0.06587682],
[-0.68903218, 1.11801579],
[-1.9337308 , 0.63038558],
[ 0.26804541, -3.41912075],
[-2.12333945, -4.50443015],
[-2.58654933, -4.9379112 ],
[ 0.16018513, -3.83652922],
[-0.88614897, -3.47589265],
[-1.59765115, -4.41483069],
[ 1.48360897, -4.35709934],
[ 1.44697712, -2.29112772],
[ 8.48668013, -2.31220758],
[ 1.27737697, 0.93420479],
[ 0.08854593, 2.81051365],
[ 2.06883858, -3.43706227],
[-0.78138299, -1.61233704],
[-0.31856672, -1.11443574],
[ 3.95584037, -4.66138876],
[-2.0951562 , -3.05837102],
[-4.58738357, -1.74537817],
[-2.66763757, -2.46386482],
[-2.57370302, -2.84072656],
[-2.98845891, -3.06711747],
[-3.701205 , -3.04206607],
[-1.61290675, -4.24162035],
[-3.66320186, -4.09758109],
[-3.95156859, -3.95691785],
[-4.28849494, -3.67404719],
[-3.99022937, -2.92534594],
[-3.02795242, -4.61601499],
[-2.06113673, -4.8504963 ],
[ 3.89602028, -5.01129356],
[ 4.66868527, -4.75463343],
[ 4.27249405, -5.52137645],
[ 4.78482642, -4.44051836],
[ 0.02676699, -5.50724361],
[-3.54961304, -3.08357687],
[-3.76737904, -3.78225243],
[-3.59432514, -3.07303315],
[-3.62792705, -2.73985894],
[-3.06532371, -3.174952 ],
[-6.22064858, 1.15407451],
[-5.09714812, 1.82131978],
[-5.34001492, 2.02732079],
[-4.56483724, 2.21245894],
[-5.09218846, 2.4294874 ],
[-5.51236077, 0.72473133],
[-2.97181487, 3.43011963],
[-1.73504223, 3.28751766],
[-1.76972577, 2.62471132],
[-2.32289342, 3.58772744],
[-1.87966015, 3.65487563],
[ 1.61109246, 4.64124374],
[ 0.01326033, 1.5468855 ],
[ 2.29479443, 1.45945475],
[-0.42317649, 3.35109663],
[ 0.29213153, 1.36445907],
[ 0.06419184, 3.47830589],
[ 0.16473305, 2.67225937],
[-1.69516157, -3.03886575],
[-0.24328844, -1.059707 ],
[-1.70203039, -2.61915102],
[ 0.44918644, -2.36052645],
[ 1.21856416, -3.23833273],
[ 6.36782773, 0.73642288],
[ 5.86566105, 6.22201134],
[ 2.32811749, 1.35622522],
[ 0.81340327, 7.90835107],
[ 0.92296306, 7.28755554],
[ 1.32066581, 6.51343355],
[-1.93304948, 5.04964033],
[ 0.89214574, 5.77325882],
[-1.93311548, 4.98293974],
[ 0.16878754, 1.31765581],
[-0.15581201, 2.39484391],
[-1.1262658 , 2.64804865],
[-0.91894902, 1.06356863],
[-1.86588194, 1.58062903],
[ 1.37994883, 6.21905437],
[-0.01613465, 5.26499384],
[-1.61769778, 3.23288493],
[-1.67074261, 4.08643755],
[ 0.0870092 , 4.47552792],
[-2.34873512, 4.3751504 ],
[-2.14290612, 1.8136091 ],
[-3.01477158, 1.36962041],
[-3.54250316, 2.37796744],
[-2.8669399 , -0.1595899 ],
[-3.03420417, 1.5630714 ],
[-3.03508243, 2.89658961],
[-2.21909884, 2.20827759],
[-2.05969086, 4.05921872],
[-1.55005899, 4.7435195 ],
[-1.19864919, 7.38946477],
[-0.06868931, 7.7256891 ],
[-1.40862679, 6.62963748],
[ 0.67776489, 7.96673913],
[ 0.38189881, 8.77442069],
[ 7.62980234, 1.86948666],
[ 6.91257294, 3.10473738],
[ 8.73901258, 2.71886885],
[ 7.29487999, 2.78439377],
[ 9.01527493, 3.31204677],
[ 9.14299203, 2.76810222],
[ 2.58584412, 2.14440232],
[ 4.72295305, 2.60215625],
[ 1.19188425, -1.00996638],
[ 8.01516938, 0.42759914],
[ 5.45084813, 0.98131544],
[ 7.37874929, -0.53429043],
[ 6.75153404, -1.06718757],
[ 5.47649138, -1.67658293],
[ 6.16730829, 0.97947431],
[11.72743348, 0.46614983],
[ 8.60415575, 4.6141592 ],
[-0.12347172, 3.43192712],
[-0.60637782, 0.77910874],
[-3.56763182, 1.33976644],
[-2.38544066, 1.95848913],
[-1.15220818, -0.67390243],
[-3.22063973, 2.6750803 ],
[-4.09421487, 1.95705408],
[-2.89979522, 1.60819509],
[ 3.6264736 , -4.91118104],
[ 5.92788489, -5.43400112],
[ 4.93422552, -4.55292399],
[ 4.7195731 , -4.25764292],
[ 3.41729015, -4.84215461],
[ 6.84896068, -5.60595577],
[-0.83074146, -2.66580792],
[-2.43810895, -0.7604101 ],
[ 3.3151307 , -1.2477695 ],
[-0.56576422, -1.21374219],
[ 4.84375027, -4.1113877 ],
[ 1.09065734, -4.36237998],
[-1.50595678, -3.00786655],
[-1.1813717 , -3.23331227],
[-0.58844272, -3.56534756],
[ 0.66200435, 1.82805944],
[ 0.09440455, -1.55637496],
[-2.41074113, 1.10272446],
[ 0.17077759, 1.29275173],
[-2.01648931, 0.28502622],
[-1.43510345, 2.15770986],
[-2.63642837, -1.32459419],
[-2.37989072, -2.75099781],
[-0.59518874, -1.40836032],
[ 0.17127407, 0.80912805],
[ 4.07344151, -1.32171421],
[ 2.0892225 , -0.39495668],
[ 2.18537714, 0.18559798],
[ 1.72078994, 2.76819856],
[ 0.9634835 , 0.59891783],
[ 5.33447453, -1.83583008],
[ 0.75309666, -2.40472434],
[-0.57828812, -2.91229356],
[-1.6041077 , -1.89033976],
[-1.24218934, -2.49121324],
[-2.04098389, -2.4827779 ],
[-2.3234503 , -2.29857771],
[-1.75482615, -3.38829039],
[-3.14194761, -2.36712914],
[-3.08302426, -1.23110785],
[-3.86592726, -0.58675473],
[-3.61880911, -1.34121736],
[-3.48250759, -1.15015708],
[-3.94549651, -0.70515166],
[-3.13198027, 0.18397254],
[-3.61423572, 0.15117433],
[-1.84562154, -0.88930777],
[-1.20765295, -0.9681736 ],
[-2.97143919, -2.75349246],
[-2.29321041, -2.75544556],
[-3.11446433, -1.85054952],
[-3.23862419, -2.27709396]])
sonar_pca_df = pd.DataFrame(sonar_pca, columns=["pc01", "pc02"])
sonar_pca_df
| pc01 | pc02 | |
|---|---|---|
| 0 | 1.921168 | -1.370893 |
| 1 | -0.480125 | 7.586388 |
| 2 | 3.859228 | 6.439860 |
| 3 | 4.597419 | -3.104089 |
| 4 | -0.533868 | 1.849847 |
| ... | ... | ... |
| 203 | -1.207653 | -0.968174 |
| 204 | -2.971439 | -2.753492 |
| 205 | -2.293210 | -2.755446 |
| 206 | -3.114464 | -1.850550 |
| 207 | -3.238624 | -2.277094 |
208 rows × 2 columns
sonar_pca_df["X60"] = sonar_df.X60
sonar_pca_df
| pc01 | pc02 | X60 | |
|---|---|---|---|
| 0 | 1.921168 | -1.370893 | R |
| 1 | -0.480125 | 7.586388 | R |
| 2 | 3.859228 | 6.439860 | R |
| 3 | 4.597419 | -3.104089 | R |
| 4 | -0.533868 | 1.849847 | R |
| ... | ... | ... | ... |
| 203 | -1.207653 | -0.968174 | M |
| 204 | -2.971439 | -2.753492 | M |
| 205 | -2.293210 | -2.755446 | M |
| 206 | -3.114464 | -1.850550 | M |
| 207 | -3.238624 | -2.277094 | M |
208 rows × 3 columns
sns.relplot(data=sonar_pca_df, x="pc01", y="pc02")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)
Color by the categorical variable.
sns.relplot(data=sonar_pca_df, x="pc01", y="pc02", hue="X60", palette="Set1")
plt.show()
/Applications/anaconda3/envs/cmpinf2100/lib/python3.8/site-packages/seaborn/axisgrid.py:123: UserWarning: The figure layout has changed to tight self._figure.tight_layout(*args, **kwargs)